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We present stochastic approximation algorithms for computing the locally opti- 
mal policy of a constrained, average cost, finite state Markov Decision Process. 
■ The stochastic approximation algorithms require estimation of the gradient of the 

cost function with respect to the parameter that characterizes the randomized pol- 
icy. We propose a spherical coordinate parameterization and implement a simu- 
lation based gradient estimation scheme using using potential realization factors. 
In this paper we analyse boundedness of moments of the estimators and provide 
a new analysis for the computational requirements of the estimators. We present 
a stochastic version of a primal dual (augmented) Lagrange multiplier method 
for the constrained algorithm. We show that the "large sample size" limit of the 
procedure is unbiased, but for small sample size (which is relevant for on-line 
real time operation) it is biased. We give an explicit expression of the asymptotic 
bias and discuss several possibilities for bias reduction. Numerical examples are 
given to illustrate the performance of the algorithms. In particular we present a 
case study in transmission control in wireless communications where the optimal 
policy has a simplified threshold structure and apply our stochastic approxima- 
tion method to this example. 



H 

' 1. Introduction 



Let S denote an arbitrary finite set called the state space. Let Ui, i € S denote an 
arbitrary collection of finite sets called action sets. A Markov Decision Process 
1124 1 (MDP) {X n } with finite state space S evolves as follows. When the system 
is in state i S S, a finite number of possible actions from the finite set Ui can 
be taken. Let u n denote the action taken by the decision maker at time n and let 
d(i) + 1 denote the cardinality of the action set Ui. The evolution of the system 
is Markovian with a transition probability matrix A(u) that depends on the action 
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u GUi, that is for i, j E S, 



A 



Aij(u) = F[X n+ i = j\X n = i, u n = u], ueUi, n = 0, 1, 



(1) 



Denote by 3Vi i n > 1 the cr-algebra generated by the observed system tra- 
jectory (Xq, . . . , X n , uo, . . . , it„_i) and set J as the cr-algebra generated by 
Xq. The filtration of the process is the increasing sequence of tr-algebras 
{■&n> n — * 0}. Define the set of admissible policies T> = {u = {u n } : 
u n is measurable w.r.t. $ n , Vn S N}. To avoid being distracted by technicalities, 
except for Sec|6] we assume that all elements of A(u) for each u are non-zero so 
that the resulting process is ergodic. In Sec|6] a wireless telecommunications ex- 
ample is presented and sufficient conditions are given so that every feasibly policy 
induces a recurrent Markov chain. 

The cost incurred at stage n is a known bounded function c(X n ,u n ) > 
where c : S x U — > M. For any admissible policy u 6 V, let E u denote the 
corresponding expectation and define the infinite horizon average cost 



J Xo (u) = lim sup— E u 



' N 

E 

71=1 



c(X n ,U n ) | Xq = Xq 



(2) 



Motivated by several problems in telecommunication network optimization such 
as admission control in wireless networks [33]!22][D3. we consider the cost (f2|, 
subject to L sample path constraints 



1 N 

lim sup —}6i(X n ,u n )< 



= 1, 1 = 1,2,. ..,L, 



where /?/ : S x U — > K are known bounded functions and 7/ are known constants. 
These are used, for example, in admission control of telecommunication networks 
to depict quality of service (QoS) constraints, see 113 3 1 1221 . For ergodic MDPs, 
these sample path constraints are equivalent to the average constraints 



lim sup — E u 

JV^oo N 



' N 

E 

,n=l 



(3i(X n ,u n ) 



< 



1 



L. 



(3) 



(4) 



The aim is to compute an optimal policy u* g V that satisfies 

J Xo (u*) = inf J Xo {u) Vx Q G 5, 

u(EX> 

that is, u* has the minimum cost for all initial states xq € S subject to the con- 
straints (0). It is well known (2) that if L > then the optimal policy u* is 
randomized for at most L of the states. If the transition probabilities Ay(ti) (fl} 
are known, then the optimal policy u* for the constrained MDP (f2]i, (01 is straight- 
forwardly computed as the solution of a linear programming problem. 
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Objectives. Problem (f2]), (J3]l is an adaptive constrained Markov decision process 
problem. This paper presents stochastic approximation algorithms for adaptively 
computing the optimal policy u* of the above constrained MDP (f2]), ([3]). 

There are two methodologies that are used in the literature for solving stochas- 
tic adaptive control problems: direct methods, where the unknown transition prob- 
abilities Aij{u) are estimated simultaneously while updating the control policy, 
and implicit methods - such as simulation based methods, where the transition 
probabilities are not directly estimated in order to compute the control policy. In 
this paper, we focus on implicit simulation-based algorithms for solving the MDP 
©, l|3}. By simulation based we mean that although the transition probabilities 
A(u), u £ Ui, i £ S are unknown, the decision maker can observe the system 
trajectory under any choice of control actions u = {u n }. In other words, the 
adaptive control algorithms we present are adapted to the filtration {$ n ; n > 0} 
defined above. Moreover, these algorithms can deal with slowly time varying 
transition probabilities A(u). 

Neurodynamic programming methods |7| such as Q-learning and temporal 
difference methods are also examples of simulation based implicit methods that 
have been widely used to solve unconstrained MDPs where the optimal policy u* 
is a pure policy, that is, it* is a deterministic function of X n . However, for the 
constrained MDP (0, (01, since the optimal policy is randomized, there seems to 
be no obvious way of modifying such methods. 

Summary of Main results In this paper we present a stochastic version of 
gradient-based optimisation methods. In Section 0] we derive a consistent weak 
derivative (WD) "phantom" estimator. This gradient estimator does not require 
explicit knowledge of the transition probabilities A(u) - so for brevity we call it 
a "parameter free" gradient estimator. 

Section I4.2I presents the implementation of our WD estimators over short 
batch lengths to facilitate real time control of the MDP. The key advantage 
is that they are amenable to adaptive control of MDPs with time varying pa- 
rameters. Moreover, these gradient estimators yield several orders of mag- 
nitude reduction in variance compared to the score function method in [4; 
0. However, as we discuss in Section 1431 a non negligible asymptotic bias may 
occur. We derive the statistical properties (consistency, bias for small batch size, 
and efficiency) of the gradient estimates. We give stochastic bounds for the num- 
ber of parallel phantoms, which allows us to approximate the moments of the 
coupling time as well as the computational complexity of the algorithm. To our 
knowledge, these results, which are applicable to realisation perturbation factors, 
are new. 
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Putting the parameter free gradient estimation algorithm into a stochastic gra- 
dient algorithm results in a new simulation based algorithm for the adaptive con- 
trol of a constrained MDP. The control problem (]4]i under constrains (O can be 
reformulated as an optimization problem using a parametrization, as will be stated 
in Section 2. The main challenge that we address in this paper is as follows: Given 
N observations of the process under a given policy parametrized by the control 
parameter a(n), say Z( n _i)jv", • ■ ■ , Z n N, and without explicit knowledge of the 
transition probabilities Pjj (u), adjust the control variable a(n+ 1) to approximate 
the optimal control. We assume that the actual instantaneous cost and constraints 
values are observed exactly. In the stationary case where the transition probabili- 
ties, cost and constraint functions do not change, one usually constructs algorithms 
such that \\a(n) — a*\\ is "small" in some sense, usually with probability one. For 
tracking however, we focus on an algorithm with adaptive capabilities that will get 
close to the current optimum, while at the same time, it reacts to changes in the 
environment. There are two important results that we present in this context. 

First, because the constraints in the MDP are in the form of long term aver- 
ages that are not known to the decision maker, it is not possible to use stochas- 
tic approximation algorithms with the usual gradient projection methods in 111 81 . 
for example. We present primal dual based stochastic approximation algorithms 
with a penalty function and augmented Lagrangian (multiplier) algorithms are 
presented for solving the time varying constrained MDP. Weak convergence of 
the action probability estimates to the optimal action probabilties is established 
when the sample size for the estimation is large. However, for the short batch 
length implementation, we show that the algorithm is asymptotically suboptimal 
and we characterise the asymptotic bias Sec l5.3l illustrates the performance of the 
algorithms on a constrained MDP with time varying transition probabilities. 

Second, because the action probabilities must always add up to one and be 
non negative, there is strong motivation to develop a parametrization that ensures 
feasibility of the estimates generated by the stochastic approximation algorithm 
at every step. We parameterize the action probabilities of the constrained MDP 
using spherical coordinates. This parametrization is particularly well suited to the 
MDP problem and has superior convergence rate, see discussion in Sec 12.41 In 101 
13, a different parameterization is used which is similar to the generalized gradient 
approach of 11301 . 

Sec|6] gives a case study of how the results in the paper can be used to estimate 
a monotone scheduling policy that arises in transmission scheduling over fading 
wireless communication channels. We prove that the optimal scheduling policy 
is a randomized mixture of two monotone (threshold) policies - this proof is of 
independent interest as it generalizes the well known results of the classical paper 
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II 101 (see also 11261 ) to the case of a correlated Markov chain channel and with 
delay constraints. 



2. Problem Formulation and Discussion 

In this section we formulate the constrained MDP ©, (J3) as a stochastic optimiza- 
tion problem in terms of parameterized action probabilities. Then for convenience 
a summary of the key algorithms in this paper is given. Finally we briefly sum- 
marize how our approach differs from two other approaches in the literature. 

2.1. Spherically Parameterized Randomized Policies 

The randomized optimal policy for the above constrained MDP can be defined in 
terms of the action probabilities 8 parameterized by ip as: 



Here ip £ '5 is a finite dimensional vector which parameterizes the action proba- 
bilities 8, and is some suitably defined compact subset of the Euclidean space. 
The unconstrained problem with pure optimal policy is a degenerate case, where 
for each i G S, 6i a = 1 for some a £ Ui. 

The most obvious parameterization ip for 8 - which we will call canonical 
coordinates is to choose ip = 8. Thus ip = {V'i} = "i € S is the set of 

action probability vectors (ipi a \a G Ui) satisfying (J5]l. As discussed in Sec 12. 41 
this parameterization has several disadvantages. 

In this paper we use a more convenient spherical coordinate parameterization 
ip = a that automatically ensures the feasibility of 8(a) (that is, the constraints 
in (O hold) without imposing a hard constraint on a. Adapted to our MDP prob- 
lem, it reads as follows: Fix the control agent i E S. Suppose without loss of 
generality that Ui = {0, . . . , d(i)}. To each value 8i ai a <G Ui associate the values 
^ia = V&ia- Then (0 yields J2aeu tfa = 1' an d ^ia can be interpreted as the 
coordinates of a vector that lies on the surface of the unit sphere in K. d W +1 , where 
d(i) + 1 is the size of Ui (that is, number of actions). In spherical coordinates, the 
angles are a.i p ,p = 1, . . . d(i), and the radius is always of size unity. For d(i) > 1, 



P[u„ = a\X n =i) = 8 ia (iP), a£Ui, ieS 



(5) 
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the spherical coordinates parameterization a satisfies: 

!cos(a,,i) if a = 

cos(a ij(a+ i)) Ilfc=i sin(a ijfc ) 1 < a < d(i) - 1 ■ 
sin(a i!d (i))) IIfe=i 1 sin ( a i,k) a = d(i) 

(6) 

Note that f9» a (a.) = Af a is an analytic function of a, that is, infinitely differentiable 
in a. It is clear that under this a parameterization, the control variables a ip ,p € 
{1, . . . ,d(i)}do not need to satisfy any constraints in orderfor 9(a) to be feasible. 
Furthermore, since 9i a (a) = Xf a involves even powers of s'm(ai tP ) and cos(ai jP ), 
it suffices to consider oti P G a. where a denotes the compact set 

a = {a lp G [0, tt/2]; * G 5, p G {1, . . . , d(i)}} , (7) 

that is, for any a £ R d W, there is a unique a G a which yields the same value 
of 9(a). Let at° denote the interior of the set a. Finally, define the compact set 
a.^ C a° for user defined small parameter fjb > as 

« M = {a lp G [/i, tt/2 - p]; » G 5, p G {1, . . . , d(i)}} . (8) 

Note that a.^ is the set obtained by removing small intervals at and tt/2 from 
a. As discussed in detail after the statement of Proposition 1 1 . 1 1 in Sec|5] exclud- 
ing and tt/2 is necessary to prove convergence of the algorithms we propose - 
however, since fi can be chosen arbitrarily small, it is not important in the actual 
algorithmic implementation. 

2.2. Parameterized Constrained MDP Formulation 

We now formulate the above MDP problem as a stochastic optimization problem 
where the instantaneous random cost is independent of a but the expectation is 
with respect to a measure parameterized by a. Such a "parameterized integrator" 
formulation is common in gradient estimation, see 1123 1 , and will be subsequently 
used to derive our gradient estimators. Consider the augmented (homogeneous) 
Markov chain Z n — (X n , u n ) with state space Z = S x U and transition proba- 
bilities parameterized by a given by 

Pi,a,j,a' (a) = Ppf„+i = j, u n+ i = a' | X n = i,u„ = a) = 9 <j A > (a) (a) , 
i,jeS,a6 Hi, a' G Uj (9) 

It follows that for any a G a° (interior of set a), the chain {Z n } is ergodic, and 
it possesses a unique invariant probability measure 7Ti a (a); i £ S, a G U\ . Let 
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E w ( a ) denote expectation w.r.t measure 7r(a) parameterized by a. From (O we 
have Jx (u) = C(6(a)), where 

C(6(a)) =E nia) [c(Z)} =J2Y1 Ti,a(a)c(*,o). (10) 

We subsequently denote C(6(a)) as C(a), whenever it is convenient. The L 
constraints (O can be expressed as 

=E, (Q) [/?,(Z)]-7, =]T X) 7r <.«( a )A(*.<»)-7J <0, l = l,...,i. 

(ID 

Define S = (£?i , . . . ,Bi). Thus the optimization problem (HJi with constraints (0 
can be written as 

Problem SI: min C(a) (12) 
subject to: B { (a) < 0, l = l,...,L. (13) 

The constraints Bi(a) in ( fT3l will be called MDP constraints. As should be 
evident from this formulation, the optimal control problem ( fT2b . ( fT3b depends 
uniquely on the invariant distribution ir(a) of the chain, rather than the (unknown) 
transition probabilities Aij(u). 

It is important to note that in this paper, the expected cost C(a) and ex- 
pected constraints Bi(a) < are not known to the decision maker (since the 
transition probabilities are not known). Our aim is to devise a recursive (on- 
line) stochastic approximation algorithm to solve Problem SI without explicit 
knowledge of the transition probabilities A(a). Such an algorithm operates re- 
cursively on the observed system trajectory (Xq, . . . , X n , uq, . . . , to yield 
a sequence of estimates {a(n)} of the optimal solution. If the unknown dynam- 
ics A(a) of the MDP are constant, then the proposed algorithm ensures that the 
estimates a(n) approach the optimal solution. On the other hand, if the un- 
known underlying dynamics A(a) evolves slowly with time or jump changes 
infrequently, then the proposed algorithm will track the optimal trajectory in 
a sense to be made clear later. Stochastic approximation algorithms for track- 
ing slowly evolving parameters has been widely studied in the literature, see (5j 
1281 . Tracking parameters that jump changes infrequently has been analysed more 
recently in 1321. 

Assumption 1.1. The minima a* of Problem SI (that is, ( fT2b , ( fT3b ) are regular, 
that is, V a Bi(a*), I = 1, . . . ,L are linearly independent. Also, a* belongs to the 
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set of Kuhn Tucker points 

KT= | a* G a>* : 3 m > 0, 1 = 1,...,L such that V a C + V a Bp, = 0, B' \i = 

(14) 

where \i = (p,\ . . . ,hl)'- Moreover a* satisfies the second order sufficiency 
condition V Q 2 C(a*) + V a 2 B(a*)[i > (positive definite) on the subspace {y € 
K L : V a Bi{a*)y = 0} for all / : Bi(a*) =0,m> 0. 

2.3. User's Guide 

A summary of the key equations for implementing the learning based algorithm 
proposed in this paper in spherical coordinates for constrained MDP Problem S 1 
«H3, (0) is as follows: 

Input Parameters: Cost matrix (c(i, a)), constraint matrix (/3(i, a)), batch size 

N. 

Step 0. Initialize: Set n = 0, initialize a(n) <E a° and vector A(n) € 

Step 1. System Trajectory Observation: Observe MDP over batch /„ = {k G 
[niV, (n+l)iV— 1]} using randomized policy &(a(n)) of (O and compute estimate 
B{n) of the constraints, (cf. d50]l or d52j of Secl53Tl. 

Step 2. Gradient Estimation without explicit knowledge of parameters: Com- 
pute gradient estimates V a C(n), V a B(n) over the batch /„ using the WD phan- 
tom estimates G„ given in (|4TT >. 

Step 3. Update Policy 9(a(n)) using constrained stochastic gradient algo- 
rithm: Use a penalty function primal dual based stochastic approximation algo- 
rithm to update a as follows: 

(15) 
(16) 



a(n + l)= a(n) - e V Q C(n) + V a B(n) max 0, \(n) + pB(ri) 



X(n + 1) = max 



1 - - X(n),X(n) +eB(n) 
P 



The "penalization" p is a suitably large positive constant and max[-.-] above is 
taken element wise, see (l22l >. (f23]i. 

Step 4. Set n = n + 1 and go to Step 1 . 

Remark 1.1. 1. For large batch size N, the bias of the estimates B and V a B are 
0(1/ N) and the algorithm ( TOI ). ( TT6b is asymptotically optimal. However, for fast 

tMore precisely, «(0) needs to be initialized in a^, where a M C a" excludes a small /i size ball 
around the boundary of a, see (8) and Sec l5. 1 1 
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tracking of time-varying MDPs it is necessary to choose N small. In this case, the 
main source of bias in the estimation of a* using ( fT3T >, ( fTol l is the covariance of 
V a B with B. Several approaches to deal with this bias are discussed in Sec l5.3l 

2. Another alternative to ( fT6l ) is to update A via a multiplier (augmented La- 
grangian) algorithm (see Sec l3.2l ). A third alternative is to fix A. For sufficiently 
large p, a(n) will converge to a(oo) which is in a pre-specified ball around a local 
minimum a*. 

3. If the true parameters of the MDP jump change at infrequent intervals, then it- 
erate averaging 1 19 1 (as long as the minimal window of averaging is smaller than 
the jump change time) and adaptive step size algorithms can be implemented in 
the above stochastic approximation algorithms to improve efficiency and tracking 
capabilities. 

4. For the special case of two actions and a monotone policy (see Section[6]), then 
a(n) can be further parameterized and the dimension of the optimization Problem 
S 1 can be substantially reduced. 



2.4. Discussion of Other approaches in Literature 

Before presenting the details of the algorithm proposed in this paper, we briefly 
summarize two works in the literature that also use stochastic approximation 
methods to solve MDPs. It is well known II24II that Problem SI formulated in 
terms of the invariant measure it is the following linear program: 



miny^ ir ia c(i,a) (17) 
ies aeUi 

subject to ^2 X n ™Pl( i > a ) < 7i 

iSS a&Ai 

TTja = X] X! ^iaAijia), ^ = X ' < 7T W < 1, % <E S,aE Uj 



With 6* and 7r* denoting the optimal solutions of (fT2l) and dPTl ). respectively, it is 
straightforward to show that 

(18) 



The closest approach to our paper is that presented in HO. The MDP in (4j 
[3j is without constraints and uses the parameterization 

6ia{tp) = ^ T— , g R,i G S, o£ Ui. 
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This exponential parameterization satisfies 

dQiu _ J 0m(l - Qiu) u = a 
dipia \-9 iu e ia u^a. 

Using the chain rule of differentiation on the cost function C(6(ip)) (defined sim- 
ilarly to ( fTUb ) 

A^vm - E i-m (£) - *. [±<m - £ ^) , 

(19) 

which is identical to the Generalized Gradient of 11301 . In our report (T), we ex- 
plain why this formulation yields the appropriate descent directional derivative for 
canonical coordinates of in Sec l2.ll See also (30j and references therein. How- 
ever, gradient algorithms based on this parameterization can exhibit slow conver- 
gence, particularly when the optimal probability vector 9* is degenerate. When 
a component of d la is zero, ( fT9] l is zero and hence a gradient based algorithm re- 
mains at this point. Because the drift of the update is proportional to the size of 
the updated component, as a component approaches zero, the magnitude of future 
updates decreases (to prevent crossing outside the feasible set). This mechanism 
slows down convergence of the gradient algorithm using canonical coordinates 
if) = 6, particularly close to the optimal solution if this has components represent- 
ing pure strategies, as is often the case. We refer the reader to HI for numerical 
examples that demonstrate that the parameterization involving spherical coordi- 
nates has superior convergence properties compared to canonical coordinates. 

In addition, the approach for derivative estimation in H [5] is via the Score 
Function method, which usually suffers from unbounded variance for infinite hori- 
zon costs. To alleviate this problem, the authors use a forgetting factor that in- 
troduces a bias in the derivative estimation. Our derivative estimators are more 
efficient and consistent, with provably bounded variance over infinite horizon, in 
a° . In Sec l7.ll numerical examples show that the variance of the score function 
method is several orders of magnitude larger than that of the measured valued 
derivative estimator. 

The above methods all use a "simulation optimization" approach, where al- 
most sure convergence to the true optimal value can be shown under an appropri- 
ate choice of parameters of the algorithms. In particular, all stochastic approxi- 
mations involved in the above mentioned methodologies use decreasing step size. 
One of the motivations of the present work is to implement a stochastic approxi- 
mation procedure with constant step size in order for the controlled Markov chain 
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to be able to deal with tracking slowly varying external conditions, which result 
in slowly varying A(u). 

Remark 1.2. Our MDP setting assumes perfect observation of the process 
The paper |4l [3) considers a partially observed MDP (POMDP), but assumes 
that the observations Y„ of the process X n belong to a finite set. In that work 
they consider suboptimal strategies of the form 9i a = P{u n = a\Y n = i}. 
Such a policy is clearly not optimal for a POMDP since the optimal policy is 
a measurable function of the history (Y±, . . . , Yk, ui, ■ ■ . , Uk-i), which is sum- 
marized by a continuous-valued information state. Such suboptimal POMDP 
models are a special case of the problem considered here and our method 
can be applied in a straightforward manner. We refer the reader to ITT41 [T6l 
[HI for structural results in POMDPs. 

3. Ordinary Differential Equations for Solving Constrained MDPs 

As mentioned above, to find the optimal value a* defined in (TPfl i (or equivalently, 
8* defined in (fToTl), our plan is to use a stochastic approximation algorithm of 
the form ( fTSt , ( TToT l. A key result in stochastic approximation theory (averaging 
theory), see for example 11191 , states that under suitable regularity and stability 
conditions, the behavior of the stochastic approximation algorithm is captured by 
a deterministic ordinary differential equation (ODE) (or, more generally, inclu- 
sion) as the step size e goes to zero. Thus to design the stochastic approximation 
algorithms and give insight into their performance, we will first focus on design- 
ing ODEs whose solutions converge to the solution of the constrained MDP (12\ . 
(f]~3l>.In other words, we will construct suitable ODEs whose stable points will be 
Kuhn-Tucker points of the optimization problem ([Til l. (fi"3l i. 

Once such ordinary differential equations have been designed, the correspond- 
ing stochastic approximation algorithms follow naturally by replacing the gradient 
with the gradient estimate (which is computed from the sample path of the Markov 
chain), i.e, by replacing V a C, V a B, B in the deterministic algorithms presented 
below with the estimators V ' a C, V Q B, B. These estimates are computed using 
the parameter free gradient estimation algorithms given in Sec l4.2l The proofs 
of convergence of the resulting stochastic approximation algorithms are given in 
SecE] 

In this section, we present a primal dual and an augmented Lagrangian algo- 
rithm. Our technical report JT) also presents a primal algorithm based on gradient 
projection which requires higher computational complexity. 
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3.1. First-Order Primal Dual Algorithm for Constrained MDP 

A widely used deterministic optimization method (with extension to stochastic 
approximation in I1 181 pg.180]) for handling constraints is based on the Lagrange 
multipliers and uses a first-order primal dual algorithm 16, pg 446]. First, convert 
the inequality constraints (fT3l in Problem S 1 to equality constraints by introducing 
the variables z = {z x ,...,z L ) e K L , so that Bi(a) + zf = 0, I = l,...,L. 
Define the Lagrangian for the constrained MDP as 



A 



C(a, z, A) = C(a) + \ A, {B l (a) + zf) 



L 

£■ 

i=i 



(20) 



Here, A; £ R, I = 1, . . . , L are Lagrange multipliers for the constraint Bi(a) + 
zf = 0. In order to converge, a primal dual algorithm requires the Lagrangian 
to be locally convex at the the optimum, that is, Hessian to be positive definite 
at the optimum (which is much more restrictive than the second order sufficiency 
condition of Assumption 1 in Sec. 12.2b . Numerical examples show that this posi- 
tive definite condition on the Hessian, which Luenberger 11201 pp.397] terms "local 
convexity", seldom holds in the MDP case. We can "convexify" Problem SI by 
adding a penalty term to the objective function (12\ . The resulting problem is: 



i=i 



Here p denotes a large positive constant. The optimum of the 
pg.429] is identical to that of ( fT2i i. (fT3l . Define the augmented 



subject to dot 
above problem 
Lagrangian, 



L 

C p (a, z, A) = C(a) + £ A,(B,(a) + zf) + £ £ (B,(a) + zf) 2 . (21) 
i=i 



i=i 



Note that the original Lagrangian may not be convex near the solution (and hence 
the primal dual algorithm does not work), for sufficiently large p, the last term 
in C p "convexifies" the However, for sufficiently large p, |6) shows that the aug- 
mented Lagrangian is locally convex. After some further calculations detailed in 
10 pg.396 and 397], the primal dual algorithm operating on C p (a(n), z(n), A(n)) 
reads: 

a e (n + 1) = a e (n) - e ( V Q C(a e (n)) + V Q B(a e (n)) max 
A e (n + 1) = max 



1- - A £ (n),A e (n) +eB(a e (n)) 

P. 



0,A e (n) +pB(a e (n)) 
(22) 
(23) 
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where e > denotes the step size and the notation z = max[i, y] for any 
two equal dimensional vectors x, y denotes the vector z with components z, = 

vmx.[xi,yi]. 

Lemma 1. Under Assumption 1, for sufficiently large p > 0, there exists 1 > 0, 
such that for all e G (0, e], the sequence {a e (n), A £ (n)} generated by the primal 
dual algorithm ( 1221 ) is attracted to a local KT pair (a*, A*) of Problem SI. 



Proof. Since C p is locally convex for sufficiently large p > 1201 . the proof 
straightforwardly follows from Proposition 4.4.2 in (6). □ 

Let T G R + denote a fixed constant and t G [0, T] denote continuous time. 
Define the piecewise constant interpolated continuous-time process 

a e (t)=a € (n) t € [ne, (n + l)e) (24) 
A e (t) = A e (n) t G [ne, (n + l)e). (25) 

Lemma [T] implies that {A e (n)} lies in a compact set A for all n. Recall from 
(HJ, that a e (n) G a' 1 where a M is compact. Then the following result follows 
directly from the above lemma and 1 1 8.1 where the convergence of the stochastic 
version is proved. 

Theorem 1. Let {a(t),X(t)} satisfy the ODEs 

^-a(t) = -V a C(a(t)) - V a B(a(t)) max 
at 



0,A(t) + P B{a{t)) 



a(0) G 



* \A,(*)/p ifXtW+pBtMt)) <0, 

(26) 

r/zen, under Assumption 1, for sufficiently large p (see LentmaUj, the interpo- 
lated process {a e (£), A e (t)} defined in A24\l . ( 1251 ) converges uniformly as e — > to 
f/ze process {a(t), X(t)}, that is, 

lim sup |a £ (i) = 0, lim sup \X e {t) - X(t)\ = 0, (27) 

0<t<T £ i° 0<t<T 

T/ie attraction point ofi26\l is the local KT pair (a* , A* ) of Problem SI. 

3.2. Augmented Lagrangian (Multiplier) Algorithms for Constrained 
MDP 

We outline two augmented Lagrangian (multiplier) deterministic algorithms for 
optimizing the constrained MDP assuming full knowledge of the objective C(a), 
constraints B(a) and their derivatives. 
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1. Inexact Primal Minimization Multiplier Algorithm: The augmented La- 
grangian approach (also known as a multiplier method) consists of the following 
coupled ODE and difference equation: 

df = - V tt C(a (n+1) (t)) - V Q B(a (Tl+1) (t)) max |o, A(n) + (t) 



A z (n + 1) = max [o, A;(n) + p^(a (Tl+1) (oo)) 



(28) 

i = l,...,L, (29) 



where a' Tl+1 - ) (oo) denotes the stable point of the ODE d28l . Iteration (l29b is a first 
order update for the multiplier, while d28T l represents an ODE which is attracted to 
the minimum of the augmented Lagrangian C p . The max in ( 1291 arises in dealing 
with the inequality constraints, see (6] pp.396]. ]6] Proposition 4.2.3] shows that 
if (a(°)(0), Ao(0)) lies in the domain of attraction of a local KT pair (a*, A*), 
then d28l i. ( 1291 converges to this KT pair. A practical alternative to the above exact 
primal minimization is first order inexact minimization of the primal. The iterative 
version of the algorithm reads |6 , pg.406]: At time n + 1 set (n + 1) = a(n). 
Then run j = 0, . . . , J — 1 iterations of the following gradient minimization of 
the primal 

a (i+1) (n + 1) = (n + 1) - e (V Q C(a w (n)) + V Q B(a (j) (n)) max 0, A(n) + pB{a^ (n j) V 

(30) 

a(n + 1) = a'' 7 ' (n + 1) followed by a first order multiplier step 

\i(n + l) = max [0, X t (n) + pBi(a{n + I))] , l = l,...,L. (31) 

Iteration (l30T > represents a first order fixed step size inexact minimization of the 
augmented Lagrangian C p {inexact because d30b is terminated after a finite num- 
ber of steps J). It is shown, see and references therein, that as long as the 
inexact minimization of the primal is done such that the error tolerances are de- 
creasing with ?i but summable, then the algorithm converges to a Kuhn Tucker 
point. 

2. Fixed Multiplier: A trivial case of the multiplier algorithm is to fix A(n) = A 
for all time n and only update a according to < f30l > with 1 = 1 iteration at each time 
instant. This is clearly equivalent to the primal update (1221 with fixed A(n) = A. 
From Theorem[U the interpolated trajectory of this algorithm converges uniformly 
as e — > to the trajectory of the ODE 

= -V a C(a(t)) - V a B(a{t)) max [0, A + pB(a(t))] , a(0) £ . 

(32) 
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The following result in j6 1 shows that the attraction point of this ODE is close to 
a* for sufficiently large p, resulting in a near optimal solution. First convert the 
L inequality constraints to equality constraints as outlined in Sec 13. II Let a* , X* 
denote the corresponding KT pair. 

Result 1. Proposition 4.2.3]. Let p > be scalar such that V a 2 £ p (a*, A*) > 
0. Then there exist positive scalars 5 and K such that for (A, p) € D £ M L+1 
defined by 



the attraction point a x - p of the ODE (l32l l is unique. Moreover, \\a X:P — a*\\ < 



4. Gradient Estimation Algorithms for Constrained MDP 

In the previous section, we presented deterministic algorithms for optimizing the 
constrained MDP Problem SI that assume complete knowledge of the gradient 
of the expected cost and constraints. This section deals with how to estimate 
the gradient of the cost and constraint by simulation. As mentioned earlier, this 
gradient estimation step is a key step in the stochastic approximation algorithms 
to optimize the constrained MDP. 

Throughout this section we focus on the process with a fixed value of the con- 
trol parameter a. We use E Q to denote expectation w.r.t the underlying probability 
measure of {Z n , n = 1, 2, . . .}. We will also use E; to denote expectation w.r.t. to 
the distribution of the random action, given that the state is X n = i. Finally, Z n 
will be refereed to as the "nominal process." 

4.1. Infinite Horizon Measure Valued Derivative Estimation 

Although we are interested in estimating the gradient given short data lengths (so 
as to deal with time varying transition probabilities), to fix ideas, we first focus on 
the infinite horizon case, that is, we build gradient estimators of 



Here F indicates a function of the state-action pair Z = (i, u), such as the cost 
c or the constraint values bi, and iri, u )a) is the stationary measure of the chain 
{Z n } under fixed parameter value a. The resulting estimators estimators V a C 
and V Q B are parameter free (learning) gradient estimators. 



D = {(X,p):\\X-X*\\<6p, p>p} 



K(\\X-X*\\)/p. 




(33) 
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In the case of spherical coordinates, it follows from the relationship I© that 
the action Uk+i (given ik+x = i) has a distribution 



Yi 



Yd(i)-i 



w.p. cos 2 (aii) 
Yi w.p. sin 2 (aii) 

1 w.p. cos 2 (a l2 ) 
Y 2 w.p. sin 2 (a i2 ) 

d(i) - 1 w.p. cos 2 (a M( , 4) ) 
d(i) w.p. sin 2 (aj i(j (i)). 



Let Y^i) = Th e va l ue of on p , i € S, p € {1, 2, . . . d(i)} does not affect the 
distribution of u^+i if ife+i 7^ i, therefore the gradient of the one-step transition 
expectation is non null only when iu+i = i, in which case we have 



-E{f(i,u k+1 )} = — — E 



da ip da>ir 



f(i,p- 1) cos 2 (a ip ) + f(i, Y p ) sin 2 (a lp ) 



1 sin 2 (a ifc ) 



fe=i 



p-i 

= -2sin(a 4p )cos(aip) JJ sin 2 (a lfc )E[/(i,p - 1) - f{i,Y p ) 
fc=i 

because the terms /(£, fc), fc < p — 1 have weights which are independent of a^ p . 
The random variable Yp is called the "phantom action" and it has a distribution 
concentrated on {p, . . . , d(i)} corresponding to 

F(Y p = a) = p _/ ig(a) , o>p, pe {1,2,... d(i)}. (35) 

J sin 2 (a lm ) 

m— 1 

Note that by construction, for p = d(i) the random variable Y d ^ = d(i) is de- 
generate. 

The random variable Y p is called the "phantom action" and it has a distribution 
concentrated on {p, . . . , d(i)} corresponding to 

P(y p = o) = p _/ ia(a) , a>p, pe {1,2,... d(i)}. (36) 



J sin 2 (a lm ) 



m— 1 



Note that by construction, for p = d(i) the random variable Y^i) = d(i) is de- 
generate. 
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The above analysis for the one-step transition expectation is an example of 
what is called a "weak derivative" 11231 . To obtain the gradient of an infinite hori- 
zon expectation d33l l, a measure-valued approach for weak derivatives is used in 
Q~). In the proof of Theorem|2] we will use the approach of the perturbation real- 
ization factors. The equivalence between the two approaches has been shown in 

ED. 

First introduce the following notation. Let {uk} denote a sequence of iid 
random variables (independent of $ n ) as follows: 

f Y P> p<d(i) 
Uk = \ (37) 
{d(i)l {uk=d{l) _ 1} + (d(i) - l)l {uk=d(l)} p = d(i) 

we now provide the basis for the interpretation of Uk- Forp — 1 < d(i) F(uk+i = 
p — 1 | Xk+\ = i) = cos 2 (ctip) Ilfe=i sm2 ( a ifc)> so mat we can re write the 
derivative expression as: 

d 

E[f(i, Uk+i)] = -2 tan(a lp )P(u fc+ i = p-1 \ X k +i = u k+1 )-f(i, u k +i)] \ u k+1 = p-l], 



da ip 



which is the basis for using an 'on-line" estimator, as we will shortly explain in 
detail: when the state is {X^+i , Ufe+i = (i,p — 1), we calculate a difference 
between the observed trajectory and that of the phantom process, where the only 
difference is that the phantom action is now Uk- Because the case p = d(i) has a 
degenerate distribution for Y p , the estimation can be made both when Uk+i = d(i) 
(in which case the phantom action is d(i) — 1), or when it is d(i) — 1 (in which 
case the phantom action is d(i)). The complete details of this reasoning appears 

in ID. 

Define the hitting time 

v(k) = min{n > : Z k +„ = (i,u k )}. (38) 

Theorem 2. Consider the MDP {Z n } governed by (0) and (0 with a € a M . Fix 
(i,p), a = p — 1. Let T denote any estimator that converges as N — > oo to C(a) 
a.s. (e.g., f = 2 n =i c (Z>n)) and K ip (a, k) defined as: 



K ip (a,k) 



- t&n(a ip ) Si^ p (k) ifp=l,..., d(i) - 1 

- cos(a l . d ( l) ) sin(a ijd(i ))[r5j !d (j)(fc) - 8 iA{i)+l (k)\ ifp = d(i) 

(39) 

Here S ip (k) = l{ Zk ={i, P -i)y 

Then for fixed a £ a°, a parameter free consistent estimator for the gradient 
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in f|33| ) is given by: 

2 N I k+u(k) \ 

Gn(i,p) = j^^K ip (a,k) \[c(i,u k ) - c(i,u k )]+ ^ c(Z n ) - v(k) f J , p = 1, . . . , d(i). 

k=l \ n=k+l J 

(40) 



The proof of the above theorem is in the appendix. The algorithm computes 
the differences between the nominal (observed) process trajectory and the "par- 
allel" ones that start at time k with action u k . Such parallel processes are called 
"phantom" processes, and we call our algorithm the WD phantom estimator. 

4.2. Fast WD for Tracking Time-Varying MDPs 

In Theorem |2 consistency of the WD phantom estimator for large sample size 
N was established. However, for adaptive control of constrained MDPs with 
time varying transition probabilities, it is necessary to implement the estimation 
over small batch sizes of the observed system trajectory so that the iterates of the 
stochastic gradient algorithm are performed more frequently to track the optimal 
time varying a* . The aim of this section is to present an implementation of the 
WD phantom estimators over short batch sizes N and to show that the resulting 
gradient estimate is still consistent. We call these "fast WD phantoms". 

The implementation of the fast WD phantoms over short batch sizes proceeds 
as follows. Suppose that the gradient V Q C(a) is to be estimated using the ob- 
served MDP trajectory over the ?ith batch /„ = {nN + 1, . . . , (n + 1)-/V}. As 
in Theorem 12 let T n denote an estimator of C(a) using the observed trajectory 
of the MDP in I n . The fast WD phantom estimators for the components (i,p), 
i G S, p € {1, 2, . . . d(i)}, of the gradient are (compare with d40li): 



G n (i,p) 



r (n+l)N min{(fc+i/(fc)),(n+l)iV} 



k—nN-\-\ 



K ip (a,k) c(i,u k )~c{i,u k )+ ^ c(Zj)-v(k)T n V n (k)] 



min{(fc+y(fc)),(n+l)AT} 



(41) 



+ J2 K w(^ k )[ Yl c{Zj)-v{k)f n V n {k) 

keC(nN) ^ j =n N+l 

where T> n (k) = l{k+v(k)ei n } (phantom k dies in /„), and C(j) = {k < 
j : v(k) + k > j} is the list of living phantoms at stage j. A similar estima- 
tor holds for the gradient of the constraints. 

The interpretation of d4TT > is as follows. At each step k, the state Z k = (i, a) 
is observed and a new phantom system (labelled by k) is started, generating the 
phantom decision ii k as described above. The fc-th phantom system "dies" at the 
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hitting time j = k + v{k), otherwise it is contained in the set of "living" phantoms 
C(j). In a computer program, this corresponds to a list. This phantom will be used 
to estimate the partial derivative of all the functions (cost and constraints) with 
respect to a;, a +i if a < d(i) — 1. If a = d(i) — 1 or a = d(i), the corresponding 
phantom system contributes (with opposite signs) to the estimation of the gradient 
w.r.t. diMi) - The difference in costs inside the brackets in (fiTT i contains the initial 
contribution of a phantom system. Afterwards, while a phantom system k is alive, 
it contributes to (HTt the term c(Zj) at each step, and when it dies (T> n (k) = 1) 
it contributes the term v(k)T n (if death occurs within the interval /„). The above 
equation takes only observations of the trajectory within the current interval, thus 
a final term appears considering the contributions of all the living phantoms at 
the start of the interval, because it is possible that phantom systems may survive 
several estimation intervals. For complete details on the implementation program 
code and other variations please see 11171 . 

Theorem 3. Assume that for any a € ot°, M^/ a \T n = C(a). Then the bias of the 

fast WD phantom gradient estimator G n (i,p) of A41\) per batch of size N is given 
by (with Ki p (a, k) defined in A39i ) 

^2 ^ 1 

E n{a) [G n (i,p)]-W a C(a) = — K ip (a,k)Cov n{a) (v(k),T n )+o(—). 

k: k+v(k)el n 

(42) 

Proof. Let G n (i,p) denote the gradient estimate (HTb when is replaced by 

C(a). We first show that the gradient estimate G n (i,p) is unbiased under the 

invariant measure w(a), that is, E ff ( a )[G n (z,p)] = V a C(a). 

Note that under ir(a), consecutive estimators have the same distribution (al- 
though they are not independent), because the invariant distribution of the number 
of living phantoms at the start of the interval is independent of n. From the ergod- 

icity of the underlying MDP, E 7r(ct) [G n (i,p)] = lim„^ 00 (l/n) X)m=o G vi(hP)- 
Because the estimation by batches considers breaking up the partial sums of the 
estimation, then the difference: 

c(i,a) — c(i,Uk) + /J c (Zj) — v(k)C(a) 
i=(fe+i) 

tends to zero in absolute value, a.s. Using Theorem [2] the term in the 

above equation converges a.s. to V a C(a) as n — > oo, for any fixed value of the 
batch size N. This establishes the claim that the gradient estimate is unbiased. 



1 (n_1) -xo 2 nN 

- /Z G m (i,p)-—J2 K *p( a > k ) 



m=0 



fe=l 
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Next, using the above expression consider G n 
consecutive estimators are additive, we obtain: 



G n . Then using the fact that the 




m— m— L k—1 

The error term e(nN) comes from the phantom systems k: k + v(k) < nN, and 
their contribution tends to zero a.s. as n — > oo. Under the invariant measure, we 
have: 



(v(k)(t m - C(a))l {fc+I/ ( fc)G j m }j = Cov OT(a )(^(fc)l {fc+y ( fe)eJm} ,f m )+C'(a)E 7r ( a )(^(A;)l {fc+I/ ( fc ) G j m }). 

□ 

4.3. Computational Cost and Memory Requirement 

Consider the fast WD phantom algorithm. Let a E a^. We will bound stochasti- 
cally the number of living phantoms in terms of Binomial random variables. 

Consider the process {Z n } in stationary operation, and call C n if(i t a) the 
number of living phantoms in d4Tb that are waiting to hit state (i,a). All these 
phantoms were created at some earlier time instant when the chain hit the state 
Zk = (i,p) and the phantom decision was Uk = a- Clearly, the maximum num- 
ber of phantoms in C n isr(i, a) satisfies: 



where t(i,a) = max(j < nN: Xj = l,uj = a), because if the state (i, a) 
is visited at time j, at that time all living phantoms that were in Cj-i(i,a) die 
and Cj(i,a) is empty. Call r(i,a) the return time to state (i,a) and let n(i) 
be the number of visits to state i (Xk = i) within two consecutive visits to 
state (i, a). Then the number of phantom systems in £ n jy(a) is bounded by a 
Binomial(p(a), n(i)), where 



nN 




a>p k=t(i,a) 



EL=i sin2 ( a ™)' 



according to the creation of the phantom systems. Clearly considering the max- 
imum value of all such probabilities, we can bound the number of phantoms on 
each list for every value of a € a' 1 . 



In (flTl i. the estimator G n (i,p) is composed of bounded quantities (the state 
space is finite) plus a contribution of N terms of the order of v{k) each, plus a 
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contribution which is proportional to the random variable: 



n(i) 



r(2,a) 



E E "(*)< 



££"(*)< 



E E "(*) 



a>P fee£„ N (i,a) 



a>p i—1 



a>p k=l 



Note that, as in the proof of Theorem[2] v(k) < max(r(i, a) : a > p) = r a.s. 
Because a M is a compact set with ergodic states, the return times are all finite 
a.s.. and E Q [r(i, a)] = l/ni a - Therefore, boundedness of the m-th moment of 

G n (i,p) now follows from boundedness of the 2m-th moment of r. 

5. Stochastic Gradient Algorithms for Constrained MDP 

In this section we present the stochastic gradient algorithms that use the param- 
eter free gradient estimators (fast WD phantoms) of Sec l4.2l to optimize the con- 
strained MDP given in Problem S 1 (([T2l. (foil). Also weak convergence proofs of 
these algorithms are presented. The stochastic algorithms presented are stochastic 
versions of the two deterministic algorithms of Sec l3.ll and Sec l3.2l Using the 
simulation-based fast WD phantom gradient estimator (|4H with local sample av- 
erages for the estimation of the constraint functions may lead to a noticeable bias, 
for small batch size N thus making the algorithm suboptimal. 

For notational convenience we consider equality constraints here (as men- 
tioned in Sec B.ll the inequality constraints can be handled with minor modifi- 
cations) so that the constrained MDP problem (fT2l . il3[ reads 



A control "agent" is associated with each of the possible visited states i £ S. 
The control parameter for this agent is the vector (cti P ,p G {1, . . . , d(i)}), plus 
an agent for the (artificial control) variable representing the Lagrange multiplier 
A. The scheme works by observing the process over a batch size N during which 
the value of the control parameter does not change. Over this batch, the constraint 
is estimated as B(n), see Sec l5.3l and the gradients are estimated as V Q C'(n), 
V a B{n) using (|4T). We assume that E ff ( Q ) [B(n)\ = B{a). Let 



h (a) = E 7r(Q) [V Q C(n)], hi(a) = E ff(Q) [V a Bi(n)], 1 = 1, ...,L 



min C(a), s.t. Bi(a) = 0, l = l,...,L 



be the invariant averages of the batch estimation (refer to Theorem|3}. 
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5.1. Stochastic First-Order Primal Dual Algorithm 

Consider the stochastic approximation where the control parameter (a, A) is up- 
dated as (c.f. d22]), (|23]l) 

a e {n + 1) = a e {n) - e ( Vj5(n) + ( A * M + pBi{n)\ V^B/(n) + Z M (n) ] , a e (0) £ at 



(43) 

X'(n + l) = X £ (n) + eB(n) (44) 

where the gradient estimators are given by the fast WD phantoms in ( |4TT > and 
Z^in) is a projection that ensures that {a e (n)} £ a* 1 . The above truncation is a 
mathematical artifice to prove convergence. In practical implementation, that is, 
when e > 0, truncation is not important since one can choose [i << e, e.g., close 
to the numerical resolution of the computer, see remark below. 

Proposition 1.1. For a e (rt), \ e (n) given by (O, (O, define the interpolated 
process {a e (t), A e (i)} as in K24\ . f f23T ). Then as e — >• 0, {a e (i), A e (i)} converges 
in distribution to the solution of the ODE (c.f. ( 1261 )) 



d , . 



M«(<)] + + pBi[a{t)\) hi[a{t)\ + n[a(t)} 

i=i 



, a(0) £ a 
(45) 



^A(t)=B[a(t)], 

where the notation [-j^ re/ers fo f/ie truncated ODE onto a M ana" f/ie added drift 
is defined as 

L 

K(a)=p lim VCov j(n )[£i(ra),V^S|W]. (46) 

Remark 1.3. A weak convergence proof of the stochastic gradient algorithm d43l . 
d44t requires uniform integrability of the gradient estimates. Without the above 
truncation, if cti p = it/ 2 or a 2 : p = 0, (equivalently one or more action probabilities 
Oiu = 0), then the hitting time z^(fc) in (l38l l of the phantom system k with initial 
state (i, u) is not uniformly bounded and the gradient estimator (HTI ) is not well 
defined. So in the weak convergence proof below we place a /i size ball around the 
boundary of ct (recall cc M above is defined as a minus this ball) and the estimates 
a e (n) are truncated to a M . However, this truncation is very different to standard 
truncations in the stochastic approximation literature: 
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1. Recall from Sec l2.1| that the boundary of the set a is a fictitious boundary and 
the action probabilities 9(a) are symmetric about this fictitious boundary (since 
they are functions of sin 2 (a) and cos 2 (a)). For example, suppose that a\ p (n) < 
ir/2 — fi and that d43T > generates the estimate a\ p (n + 1) = n/2 + fi + K for 
some small constant K > 0. Then truncation is not required since by symmetry 
a t P ( n + 1) = "72 — (/-' + K) G a>1 - Thus in practical implementation, that 
is, when e > 0, truncation is not important since we can choose fj, << e. The 
probability that an update lies precisely in a ball of radius /i is negligibly small - 
since if the estimate overshoots or undershoots this ball, it is automatically in a''. 

2. Suppose that the untruncated version of the ODE (l45l l has a stable point on the 
boundary of a, e.g., a pure policy. Then clearly, the truncated ODE will have a 
stable point within O(fi) of the stable point of the untruncated ODE. Hence the 
truncation is very different to standard ODE truncations such as i 



Proof. We first show that the term in parenthesis on the RHS of (T43b is uniformly 
integrable. Note that all the terms in the gradient estimate (T4Tb apart from £(•) and 
v(-) are uniformly bounded for a G ot^ since the MDP is finite state and the batch 
size N is fixed and finite. Hence a sufficient condition for uniform integrability is 
to show that J2kec(nN) v $) in (|4TT > has finite variance. 

Fix (i,p) for the estimator in (t4TT >. We focus on the phantoms that have an 
initial decision Uk — a for a fixed a - and we call these a-phantoms. By definition, 
all a-phantoms die simultaneously at time n\ when the process Z ni = (i, a). Let 
no denote the previous time instant at which {Z n } visited (i,a), that is, Z„ = 
(i,a). The longest hitting time v(k) of all these phantoms is bounded by the 
time between successive returns n\ — no- Consider now the number of living 
a-phantoms in C(n) at any time n E [no, n\]. Each of these a-phantoms must 
have been created when the process hits the state (i,p) and the phantom decision 
chosen is a - which happens with probability ^ 9ia g . ■ Therefore an almost sure 
upper bound for the cardinality of C(n) is tlx — n . Note that n\ — n is the return 
time of a ergodic MDP on a finite state and therefore has all moments bounded. 

Having established uniform integrability of the updates, the result follows by 
direct application of Theorem 5.2.1 in 111 81 . The continuity of the invariant expec- 
tations follows from the fact that the transition kernel of Z n is analytic in a. To 
characterize the drift functions, use: 



Bi (n)V a B t (n) = E, (a) [B t (n)} E <a) [V a B t (n)]+Cov w(a) [Bj (n), V a Bi (n)] 



which establishes the result. 



□ 
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5.2. Stochastic Augmented Lagrangian Multiplier Algorithm 

Consider the following stochastic approximation version of the multiplier algo- 
rithm (fJTJ: 

a e (n+l) = a £ (n)-e (\UC{n) + ^(Af(n) + pB{n)) VJ3i{n) + Z*(n)j , a e (0) 6 a 

(47) 

where Z^(n) is a projection that ensures that {a e (n)} E a} L and {X e (n),e > 
0, n E N} is any tight sequence. A trivial example is when A e (n) is a bounded 
constant (a.s.). 

The following result regarding the weak convergence of (|47| | is proved in the 
appendix. 

Proposition 1.2. Assume that {A e (n), e>0,h£ N} is tight. Define the interpo- 
lated process of (t) of M7\ as in A24\l . Then as e — > 0, the interpolated process 
a e {t) converges in distribution to the solution of the ODE: 



da(t) 



(It 



h [a{t)] + +pBi[a(t)]) hi[a{t)} + n[a(t)} 

(48) 

where Xi is an accumulation point of the sequence {A(e), e > 0} of (convergent) 



, a(0) e a", 



Cesaro sums: 

N 



N-too N 

n=l 



and k(o) is defined in 



Remark 1.4. If the bias in hi(a) and k(o) are negligible, then under no trun- 
cation, the ODE ( l48l l reduces to ( f32l > - which is the ODE for the deterministic 
fixed multiplier algorithm. ResultQ]of Sec l3.2| implies that the estimates converge 
weakly to a near optimal point, provided that the pair (A, p) is well chosen. The 
bias in hi(a) and n(a) is of order 0(1/N). 

Consider the following update of the multiplier A in (l47b . Define 1 = [l/e\ 
and consider the recursion 

1 « x 

A(n + l) = A(n)+B(n/I)l { „ eN |, where B{n/1) = - £ B(J), 

j=f>-l)X+l 

(49) 
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together with ( |47| i. Thus the multiplier is updated once every 1 time points. If 
the bias in hi(a) and re (a) is negligible, then as e — > 0, the algorithm (|47|i- 
( |49l l converges weakly to the deterministic system ( |28l l, < f29b which is the exact 
multiplier algorithm. As mentioned in Sec l3.2l this in turn converges to a local 
KT point. In a practical implementation, one would choose I as a large positive 
integer. In our numerical examples, see (JJ, even a choice of I = 10 resulted in 
convergence to a KT point. 

5.3. Tradeoff between Bias and Tracking Ability 

The three sources of bias in the stochastic gradient algorithm (I43t . t44i are the 
bias in the estimates V a B, V ' a C and B. A quick mathematical artifice for elim- 
inating the bias is to use batch sizes N(e) — > oo as e — > 0. Then the ODE (l45l l 
becomes identical to (l26l i. In the numerical examples of (TJ we chose A = 1000 
- for finite N the bias is 0(1/N). Although choosing N(e) — > oo is theoretically 
appealing, it is of no practical use since the stochastic gradient algorithm will not 
respond quickly to changes in the optimal policy caused by time variations in the 
parameters of the MDP. In our conference paper 11171 we use batch sizes N = 5, 10 
to update the parameter a(n) frequently, and indeed the stochastic approximation 
algorithm can be implemented even for N = 1. The bias in the gradient estima- 
tor of Theorem [3] can be controlled using averaging of the estimation of the cost 
function, or other smoothing statistical techniques. However, what we really are 
interested in is the resulting bias of the stable point of the limiting ODE. The re- 
sults of extensive numerical studies indicate that the bias in hi (a), I = 0, 1, . . . , L, 
has negligible effect on the behaviour of the stochastic gradient algorithm even for 
small batch sizes of N = 5. For example, we performed comparisons using the 
local sample average T n = J2kei n C i^k)/N and then using the actual theoretical 

value r„ = C{d) in G n of ( |4TT > as well as for the constraints with no remarkable 
difference in the stochastic gradient algorithm. 

The main source of bias for small batch sizes is that introduced by re(a), in 
Propositions 11.11 and l 1 .21 Using the local sample average over the n-th batch 



yields a noticeable asymptotic bias re(a). A better alternative is to use the Ce- 

saro sum Bf(n) = — Bi{k). Since Bf(n) — > Bi(a) a.s. for all a G a° 

as n — > oo, this Cesaro sum estimator would correct the asymptotic bias of 
the stochastic gradient algorithm. However, running averages do not respond to 
changes in the underlying parameters (e.g. transition probabilities) of the MDP 




1 



(50) 
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since they are decreasing step size algorithms. Hence they cannot be used for 
tracking time varying optimal policies. To handle this tracking case, we use in 
II 171 an exponential smoothing 

a e (n+ 1) = a c (n) + e (vjj(n) (xt(n) + pBf(njj V^B;(n)") (51) 

^ i=i ' 

\\{n+ 1) = A f (n) + £B ; e (n), Bf(n + 1) = Bf(n) + S (bi(ti) - Bf(ri)j , l = l,...,L, 

(52) 

where S > and Bi(n) is the local sample average in 1501 . Using a two time 
scale stochastic approximation argument it can be shown that if e/S —> 0, e.g., 
if S = ^/e, and e — > 0, then the asymptotic limit points of the corresponding 
ODE are unbiased. In practical implementation, for non zero 6, the estimates are 
biased. While the asymptotic bias can be controlled, the exponential smoothing 
delays the reaction time of the stochastic gradient algorithm since a faster time 
scale has been introduced, as illustrated in Section|T2] 

6. Case Study: Monotone Policies for Packet Transmission Scheduling 
over Correlated Wireless Fading Channels 

In this section we consider a special case of Problem S 1 that arises in transmis- 
sion scheduling in wireless telecommunication systems. The constrained MDP 
we consider, models a transmission scheduling problem in a wireless telecommu- 
nication network. The action set is U — {0, 1} corresponding to transmit and do 
not transmit, respectively. By using a Lagrangian formulation for dynamic pro- 
gramming, we show that the optimal policy is a randomized mixture between two 
deterministic monotone (threshold) policies; such a policy is a two-step staircase 
function as plotted in Fig. [I] So the weak derivative based stochastic approxima- 
tion algorithms presented above can be used to estimate this structured optimal 
policy. Because of the threshold structure of the optimal policy, the algorithm 
implementation is very efficient. 

Consider the following transmission scheduling problem over a correlated fad- 
ing wireless channel. At each time slot, a user has to decide whether to transmit 
a packet unless the packet storage buffer is empty. The objective is to minimize 
the infinite horizon average transmission cost subject to a constraint on the av- 
erage delay penalty cost. As in Oil 1341 . we model the correlated fading wire- 
less channel by a finite state Markov chain (FSMC). That is, we assume the 
channel state evolves according to a FSMC, and the channel state realization is 
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known at every time slot. At time n = 0, 1, . . ., the system state is the 3-tuple 

X n = [X>,XZ,Xv],wbcn: 

(i) X b e B = {0, 1, . . .} is the buffer occupancy state 

(ii) X° is the state of the correlated wireless communication channel. Assume 
X c n e C = {Ci,C2, ■ ■ ■ ,Ck}, where C is the finite channel state space and Cj cor- 
responds to a better channel state than Cj for all i > j; X% evolves as a Markov 
chain with the transition probabilities A C (X^ +1 = Cj\X^ = d) = a^. 

(iii) X% is the number of new packets arriving at the buffer. For simplicity, assume 
an i.i.d binary packet arrival process, that is at any time n X% € y = {0, 1}, 
which denotes either the arrival of no packet or one packet, with the probability 
mass function P(X% = 1) = 5 and P{Xl = 0) = 1 - 6. 

The system state space is then the countable set S = B x C x y 

Let the action sets be U = {0, 1} for every buffer occupancy state X b > 

and Uq = {0} for the buffer occupancy state X b = 0, where and 1 correspond 

to the action of not transmitting and transmitting respectively. 

We now define the costs and constraints of constrained MDP in Problem S 1 : 

• The transmission cost function c(-, ■) : C x U — > R+ is a function of 
the channel state. Assume that when a transmission is not attempted, no 
transmission cost is incurred, that is c(-, 0) = 0. 

• The constraint is specified by a delay penalty cost /?(•,•) :CxW-> R + , 
which is applicable only for buffer occupancy state x b > 0. Assume that 
when a transmission is attempted, there is no delay penalty cost, that is 
/?(-,!) =0. 

For channel utilization enhancement, it is assumed that c(-,u) is decreas- 
ing and /?(-, u) is increasing in the channel state, that is the transmission 
cost is lower and the delay penalty cost is higher for better channel states. 

If the transmission of a packet over the channel is attempted, that is action 
u = 1 is selected, a packet will be successfully received (and hence removed from 
the buffer) with probability given by the function / : C — > [0,1]. Here, /(•) is 
a user-defined increasing function, that is, a higher (better) channel state has a 
higher success probability. 

The transition probabilities of the constrained MDP are then given by 

F(X n+1 \X n , u = 0) = P(^ +1 |X^)P(^ +1 )I(X^ +1 = X b n + XI) 

p(x n+1 \x n , u = i) = p(x* +1 \xz)nxy n+1 )f(x c n )i(x b n+1 = x b n + xi - 1) 
+ p(x; l+1 |x r c jp(^ +1 )(i - f{x c n ))i{x b l+l = x b + x%), 



where I( ) is the indicator function. The corresponding constrained MDP is then 



October 25, 2011 0:37 World Scientific Review Volume - 9in x 6in MOR06 



2S 



Vikram Krishnamurthy and Felisa Vazquez Abaci 



given by @, (J3]l with only one constraint, that is, L = 1: 



lim sup — I 

AT->oo TV 



N 



<7- 



(53) 



In the remainder of the section we will outline the steps involved in prov- 
ing the threshold structure of the optimal policy of the above constrained MDP 
and describe how the threshold structure can be exploited in the proposed weak 
derivative based stochastic approximation algorithm. 

Pr(a — 1; x, y) 



bi 



Buffer occupancy state 



Fig. 1 . The constrained average cost optimal policy u* ( [x b , x c , x y ] ) is a randomized mixture be- 
tween two policies that are deterministic and monotonically increasing in the buffer occupancy state 



The steps involved in proving the structural result for the considered countable 
state, infinite horizon average cost constrained MDP includes 

• Derive a condition for which all constrained policies induce a stable 
buffer and recurrent Markov chains. 

• Use the Lagrange multiplier formulation and prove the existence of an 
unconstrained optimal policy under the condition for buffer stability. 

• Prove the threshold structure of the unconstrained (Lagrangian cost) 
optimal policies by using the supermodularity concept. The structure 
of the constrained optimal policy then follows due to a well-known 
result that relates unconstrained and constrained optimal policies (8j 
0. 



The condition for buffer stability and recurrence of the Markov chains is as 
follows, see II22II for proof. 



Lemma 2. Denote min{/(x c )} =/; min (3(x c , 0) = fi. Ifj < 1— ^, then every 

policy u satisfying the constraint {53} induces a stable buffer, and a recurrent 
Markov chain. 
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Lagrange formulation and existence of an optimal policy 

We convert the constrained MDP to an unconstrained MDP by the Lagrange mul- 
tiplier method. In particular, for a Lagrange multiplier A, the instantaneous La- 
grangian cost at time n is c(X n , u; A) = c(J°, u) + \/3(X°,u). The Lagrangian 
average cost for a policy u is then given by 




and the corresponding unconstrained MDP is to minimize the above Lagrangian 
average cost. 

The existence and threshold structure of an unconstrained stationary average 
Lagrangian cost optimal policy are established by viewing the average cost MDP 
model as a limit of discounted cost MDPs with discount factors approaching 1. In 
particular, 126 ; 27! provide the theory for relating average cost optimal policies to 
discounted cost optimal policies. Define the discounted cost as below 

(55) 

where < v < 1 is the discount factor. Define the optimal discounted cost 
by V" (in) = inf J" (u; A) (for notational convenience we omit the notation of 

Lagrange multiplier A in V v (xq)). 

In 126; 27 1, the authors proved that the average cost optimal policy exists and 
inherits the structure of the discounted cost optimal policies under the following 
conditions: 

Al. For each state x and discount factor v, the optimal discounted cost V v {x) is 
finite. 

A2. Assume a reference state 0. There exists a nonnegative N such that —N< 
h u {x) = V v {x) - V u (0) for all x £ S and v £ (0, 1). 

A3. There exists nonnegative M x , such that h v (x) < M x for every x £ S and v. 
For every x there exists an action u(x) such that ^{x'\x, u(x))M x i < oo. 

x' 

Define the reference state by X° = [x b = 0, x c = Ck,x v =0]. In light of 
Lemma |2] it is clear that the policy of always transmitting whenever the buffer is 
not empty will induce a stable buffer, and hence finite expected time and cost for 
first passage to the reference state. As a result, due to Propositions 5(i) and 4(ii) 
in 11271 . Al and A3 hold. Furthermore, as all instantaneous costs are bounded, the 
following value iteration converges to the optimal discounted cost V v {-) for all 



J: (u;A)= lim E u 



^ v n 1 c(X n> u n ;X)\X = x 
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discount factor v 

K+i ( x ) = mi, 1 } Qn+i (x, u) (56) 



x (x,u) = c(x,u;X) + ^{x'\x,u)V^(x',a). (57) 



x'es 



Using the recursion (f56l> — (T57l> it can be shown by induction that V^([a; & , x c , x v ]) 
is increasing in x b and x v 11211 , which implies that A2 holds. 

Threshold structure of discounted/average cost optimal policies 

Due to convergence of (1561 . ( |57] i for all initial conditions, in order to show that 
the discounted cost optimal policy is monotonically increasing in the buffer state 
b it suffices to show Q°°([x b , x c , x v ],u) is submodular in (x b , u) for all x b > 1 
for some initial condition [29 1. This can be done via mathematical induction as in 
the theorem below, see 11221 for proof. 



Theorem 4. The discounted (Lagrangian) cost optimal policy is a threshold 
policy of the form 



„6 „c ^ n _ / if < x b < b(x c ,x y ) 
1 if b(x c ,x v ) < x b : 



<([x\x°,xy]) = i :f -_ v :"J ' (58) 



where b(x c , x v ) : C x y — > B defines the threshold for the pair of channel state 
and packet arrival event (x c , x v ). 

Therefore (unconstrained) Lagrangian average cost optimal policy, which in- 
herits the threshold structure of some sequence of discounted cost optimal poli- 
cies, is of the form d58l . Due to Theorem 4.3 in [81, the constrained optimal policy 
for the constrained MDP is a randomized mixture of two threshold policies: 

{0 if0<x b <b l {x c ,x v ) 
q(x c ,x y ) if bi(x c ,x v ) < x b < b 2 (x c ,x y ) 
1 ifx b >b 2 (x c ,x y ). 

(59) 

Here for each channel state x c € C and packet arrival state x v <G y, q{x c , x v ) <E 
[0,1] denotes the mixture probability and u^,U2 are monotone policies in the 
buffer state x b of the form d58l with threshold states bi(x c ,x v ) and b2(x c ,x v ), 
respectively. Therefore, the optimal policy u has a simple threshold structure. 
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7. Numerical Examples 

7.1. Comparison of Efficiency of Gradient Estimators 

Since efficient gradient estimation of the costs and constraints is an essential ingre- 
dient of the adaptive control algorithms proposed in this paper, in this subsection 
we compare the gradient estimators with other gradient estimators. 

Score Function Estimator of Bartlett & Baxter OS): The Score Function 
gradient estimator of (fT9] i can be derived using the measure-valued approach of 
II 131 . Let an arbitrary finite- valued random variable Z take value j with probability 
p e (j). Then 

^-F{Z) = ^tY^FQDpoU) = E (4~ HM3)] ) F (3)Pe(j)=E a [F(Z)S(9 ia ,Z)] 

ia la . . \ la 

A pi 

where S{9i a , Z) — \n[pg(Z)} is known as the Score Function. Returning to 
the Markov process {Z n }, using P(Z k+1 = (i,a)\Z k = (j,u)) = Aji(u)6 ia 
yields S(8i a , -Zfc+i) = j^l{z k+1 =(i,a)}< which is not uniformly bounded in 6: 
when one or more components of the control parameter tend to zero (which they 
do when a policy is pure instead of randomized) the estimator blows up. To over- 
come this problem a Score Function estimator is used in 1H [3] with the expo- 
nential parameterization. When inserted in the formula ( fT9] l for the chain rule, 
the Score Function estimator for the Markov Chain is of the form J2 n S(0ia, Z„) 
and it is a well known problem that the variance increases with time. Numerous 
variance reduction techniques have been proposed in the literature 1123 1 includ- 
ing regenerative estimation, finite horizon approximations and more recently (4] 
H] propose to use a forgetting factor for the derivative estimator. Their method 
suffers therefore of a variance/bias trade-off, while our estimation method is con- 
sistent and has uniformly bounded variance (in N). 

System Parameters: We simulated the following MDP: S = {1,2} (2 states), 
d(i) + 1 — 3 (3 actions), 

.. n , /0.9 0.A /0.3 0.7\ /0.5 0.5 

The action probability matrix (6(i, a)) and cost matrix (c(i, a)) were chosen as: 



(*(«») 



0.2 0.6 0.2 
0.4 0.4 0.2 



00») 



50.0 200.0 10.0 
3.0 500.0 0.0 



Gradient Estimates in Spherical Coordinates: The theoretical values of the 

( 45.05 — 55.07 \ 

gradients areV Q CAr(a)= " The gradient estimates d40b for 

& v ' \187. 58 —159.91/ 
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batch sizes N = 100, 1000 are: 



G 



1000 



/ 43.333 ±4.791 
Vl96.956± 8.951 

/ 44.322 ±1.323 
V 189.549± 2.829 



-54.191 ±2.096 
-161.200 ±4.918 

-53.656 ±0.712 
-164.329 ± 1.231 



In the above expression, the numbers following the ± sign are confidence inter- 
vals which were estimated at level 0.05 using the normal approximation with 100 
batches. 

WD Phantoms versus Score Function Method: As mentioned in Sec 12. 41 the 

closest approach to the algorithms in this paper is that in OO, which uses a Score 
Function method to estimate the gradients. Here we compare our gradient estima- 
tor with the score function gradient estimator of |4l |5J. Since the score function 
gradient estimator in O [3j uses canonical coordinates 9, to make a fair compari- 
son in this example we work with canonical coordinates. A detailed development 
of the appropriate modifications to the corresponding factors multiplying the re- 
alisation perturbations is in (TJ. The theoretical values of the generalized gradient 
([j"9l for the above MDP are 



v^[c(0WO)] 



-9.010 18.680 -9.670 
-45.947 68.323 -22.377 



(60) 



We simulated the WD phantom and score function (SF) estimators for (fT9l in 
canonical coordinates, see (TJ for implementation details. For batch sizes N = 
100 and 1000, the WD phantom gradient estimates are 



VC 



.WD 
100 



^C 1000 — 



-7.851 ± 0.618 17.275 ±0.664 -9.425 ±0.594 
-44.586 ± 1.661 66.751 ± 1.657 -22.164 ± 1.732 

-8.361 ± 0.215 17.928 ±0.240 -9.566 ±0.211 
-46.164 ± 0.468 68.969 ± 0.472 -22.805 ± 0.539 



Again the numbers after ± above, denote the confidence intervals at level 0.05 

WD 

with 100 batches. Note that for some of the elements of VC above, the confi- 
dence interval estimates do not contain the theoretical value (l60b . The variance of 
the WD phantom estimator is shown in TableQ] together with CPU time. 



N = 1000 


WD 

Var[VCjv ] 


i = 


1.180 


1.506 


1.159 


i = 1 


5.700 


5.800 


7.565 


CPU 


2 sees. 
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We implemented the score function gradient estimator of (4J [3] with the fol- 
lowing parameters: forgetting factor 1 (otherwise the estimates are biased), batch 
sizes of N = 1000 and 10000. In both cases a total number of 10, 000 batches 
were simulated. The score function gradient estimates are 



VC 



,SF 
10000 



VC 



r-.SF 

1000 



-3.49 ±5.83 16.91 ±7.17 
-41.20 ± 14.96 53.24 ± 15.0 

-6.73 ± 1.84 19.67 ±2.26 
-31.49 ±4.77 46.05 ±4.75 



-13.42 ±5.83 
12.12 ± 12.24 

-12.93 ± 1.85 
-14.55 ±3.88 



The variance of the score function gradient estimates are given Table|2] 



TV = 1000 


Var[VCV 




i = 
% = 1 


89083 
584012 


135860 
593443 


89500 
393015 


CPU 


1374 sees. 



N = 10000 


Var[VcS 


i = 


876523 


1310900 


880255 


i = 1 


5841196 


5906325 


3882805 


CPU 


13492 sees. 



Note that even with substantially larger batch sizes and number of batches 
(and hence computational time), the variance of the score function estimator is 
orders of magnitude larger than that of the WD phantom estimator. 



7.2. Stochastic Adaptive Control of time varying Constrained MDP 

We consider adaptive stochastic control of the following time-varying constrained 
MDP: For time up to 4000, S = {0, 1}, U, = {0, 1, 2},ieS (d(0) = d(l) = 2), 



A(0) = 



0.9 0.1 
0.2 0.8 



A(l) = 



0.3 0.7 



A(2) = 



0.5 0.5 



,0.6 0.4/ ' v ' \0.1 °- 9 y 
The cost matrix (c(i, a)), two constraints matrices (/3i(i, a)), (fad, a)) are 



(c(i,a)) 



50 200 10 
3 500 



20 100 -8 
-3 4 -10 



, $2 = 



10 -20 22 
-19 17 -15 



The optimal control policy incurs a cost of -111.80 (or equivalently a reward of 

"0 0.2 0.8' 



1 1 1.80) and is randomized with probabilities (fT8l 9* 



0.28 0.72 



For time 
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between 4000 and 12000 the transition probabilities are 

0.5 0.5" 
0.45 0.55 ' 

This has an optimal cost of -44.52 (that is reward of 44.52). 

The algorithm was initialized with randomized policy 8(0) = ^' ^ ^ ^ ^ . 

0.0 0.2 0.8 

The batch sizes over which the gradients are estimated was chosen as N = 10. 
The parameters used in the primal dual algorithm are p = 100, e = 2 x 10~ 7 (see 

As can be seen from Figffl it takes only around 100 batches (1000 time points) 
for the algorithm to rapidly approaches the optimal policy. The algorithm also 
quickly responds to the change in optimal policy at batch time 400. The choice of 
the discounting factor 5 in d52l of the primal dual method clearly shows the trade 
off between bias and tracking ability in Fig|2] For S = 1.0, the algorithm has fast 
tracking properties but a large bias. For 5 = 0.5 and 8 = 0.1 the bias gets smaller 
but the adaptation rate is slower. Our conference paper 1171 and report HI give 
additional numerical examples. 
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Fig. 2. Primal dual algorithm based stochastic adaptive controller 
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8. Conclusions 

In this paper simulation based gradient algorithms have been presented for adap- 
tively optimizing a constrained average cost finite state MDP. First a parameteriza- 
tion of the randomized control policy using spherical coordinates was presented. 
Then a novel measure-valued gradient estimator using WD phantoms for short 
batch sizes was presented. The ensuing algorithm for estimating gradients of the 
cost and constraints does not require explicit knowledge of the transition proba- 
bilities of the MDP or any off-line simulations. 

As illustrated in Sec 17. II the resulting gradient estimator has much smaller 
variance than score function based gradient estimators. The WD phantom gradient 
estimator was then used in a stochastic gradient algorithm with fixed step size in 
order to track time varying MDP with unknown transition probability matrices. 
Primal dual and multiplier based stochastic gradient algorithms were presented 
for handling the constraints. 



9. Appendix: Proof of Theorem [2] 

To relate our formula d4Qb to the realization perturbation formulas in |9 1 use the 
following argument. The matrix P{a) in denotes the transition probability 
of {Z n } using a fixed value of the control parameter a. Fix the indices i,p and 
consider a perturbation of a^ p into on tP + S. The new transition matrix is P(a + 
5) = P{a) + 8Q{a) + o{8 2 ), for Q{a) = P'(a) the matrix with entries equal to 
the derivatives of the entries of P{a) with respect to oti lP . In ||9j, it is established 
that 



where ir is the stationary state probability of the chain and the vector g has entries: 



d 



E <a) [c(Z)]=n(Q(a)g), 



(61) 



da 



g(i, u) = lim E„ V F(Z n ) | Z = (i, u) - (N - l)C(a). 

/V— vnn — J 
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Let p = 1, . . . , d(i) — 1, then by definition, the matrix Q satisfies: 



Q{j,u),{i 



d . . d 

P(j,u)(i\u'){a) =Pj,i'(u)- 



da 



Pj,i'(u) x < 



doLi 



Ip 

i'^i 
i' = i, u' <p — 1 

— 2 tan(ai p )f5j/ p _i i' = i,u'=p—l 
+2ta,n(a ip )9i^ u > i' = i,u'>p 



where we have used the expression 6^ u i — cos 2 (ai iU '+i) Ilm^i sin 2 (o;i )U '+i). 
Using now ttP = ir, and the expression for Q, ( f6Tb becomes: 



d 



da 



if) 



■E 7r ( Q )[c(Z)] = -2tan(aip)7r 4!tl /E4ff(«,p- 1) - g(i,Y p )]. 



As explained in ®B},g(i,p-l)-g(i,Y p ) = E^ST^ c(Z n )-u(k)C(a),vn& 
initial condition = (i,p — 1) and nu(k) = min{n > : Zk+ n = (i, Y p )}, so 
the resulting expression is: 



8 



dai 



■E ff(Q) [c(Z)] = - 2tan(aip)^, p _iE i 



k-\-nu(k) 

E Q [c(Z n ) - nu(k)G{a) \ Z k = (i,p-l)] 

n—k 

(62) 



for p = 1, . . . , d(i) — 1. When implementing the above equation on-line, we use 
the observations of the actual process {Z/,} to estimate the various derivatives. 
The proportion of time that we observe action p — 1 at state i is, by definition, 
f?j thus the averaging over JV observations yields 



d 



da 



1 * 

— E ff ( Q )[c(Z)] = - 2tan(a ip ) Jim — Y] E 



fc+i/(fc) 

*i,p(fc) ^ E a [c(Z n ) - u(k)C(a) 

n—k 

(63) 



The result (t40l > is established noticing that fjv converges a.s. to C(a), because 
for a <S a M E a [f(fc)] < oo, being a coupling time of a positive recurrent Markov 
chain. The case p = d(i) consists of a degenerate case and a similar chain rule 
argument can be shown for the factor K i d ^ (a, k). A detailed derivation of this 
formula can be obtained also using measure-valued differentiation l23lll3ll . and 
appears in (TJ. 
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10. Appendix: Proof of Proposition [O] 

Proof. First, from tightness of {X e (n)}, it follows that the family {A(e)} of (de- 
terministic) averages lies in a compact set, thus the accumulation points A exist. 

From the proof of uniform integrability in Proposition 11.11 and tightness 
of {A e (n)}, it follows that the sequence {(a c {n + 1) — a £ (n)/e)} is uni- 
formly integrable, which implies that {a e (n)} is tight. Therefore for any se- 
quence (a Cfc (•), A(efe)) there is at least one (weakly) convergent subsequence with 
a.s. Lipschitz continuous limit (refer to II19II ). For the rest of the proof, until 
specified, assume that e labels a weakly convergent subsequence (to avoid the 
cumbersome indexing). We will now identify the limits of such convergent sub- 
sequences and show that they all satisfy the same ODE. Also to ease the notation 
in the proof, call Y (n) = V^5(n), Y;(n) = V^B~i{n). 

From the definition (|24| | it follows that: 



a e (t + s) -a e (t) 



L(*+s)AJ-i 

E 

n=|t/e| 



r (n)+£;Ar(n + l)y»(r 



i=i 



Divide now the interval (t,t+ s] into subintervals of small size r5 e containing 
each a number n e of updates, as shown in Figure [3] 



i i i i i i i 



Fig. 3. 8 t = tn e . Condition: 8 t — > 0, n e — > oo as e — > 0. 



Using the S c grouping of subintervals, one obtains the telescopic sum: 



L(*+ S )/5 £ j-l 



(Z+l)n e -l 



Y (j)+^2xt(j+i)mj) 



i=i 



a £ (i+s)-a £ (i) = - S e xl— J2 

J=|i/*«J 3= ln ' 

We now show that if $ e (t) denotes the cr-algebra generated by the interpolated 
process up to time t, then: 

L(t+s)/5 e J-l / L \ 

5 e {h [a e (ln e )] + J2(H^ + HaVn e )})hi[a € (ln e ) + K[a e (lnm 
i=\t/s,\ \ 1=1 ) 

where the expectation of the absolute error in the end points of the S e dis- 
cretization vanishes as e — > 0. Let E;„ e denote the expectation condition- 
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ing on the information available up to the start of the current small subinter- 
val of size S e . Use now conditional expectations to express E[Yi(j) | $ e (t)} = 
E[E;„ e [Yi(j)] | $ e (t)], ln t < j < (I + l)n e for each term in the telescopic sum 
I = 0, . . . , L. That is, we use a filter of the terms, focusing on each of the averages 
within subintervals. 

Because one is interested in averages, any version of the process {o; e (?i)} can 
be used to characterize these conditional expectations. In particular, Skorohod 
representation establishes that there is a process a e (n) for each e in the weakly 
convergent subsequence, such that a e (n) has the same distribution as a e (n) and 
it converges with probability 1 to the same a.s. continuous limit a(t) (see 11191 ). 
Because a(t) is Lipschitz continuous w.p.l, ||a(W e + <5 C ) — a(W e )|| = 0(5 e ) and 
since a e (n) converges w.p. 1, it follows that sup ;ne<:)< ( ;+1 ) Ilc ||<S c (j) — a(/<5 £ )|| — > 
a.s. which implies that the underlying distribution of the batch estimators 
Yi(j),j = ...,(/ + l)n e — 1 converges to that of the fixed-a MDP at the 
parameter value a = a(lS e ). Using the fact that a e (ln e ) converges in distribution 
to a(lS e ), it follows that: 



(7+l)n £ -l 



1 £ 



j=ln e 



i=i 



= E,,,,. 



(2+l)n e -l 



- v .... 

n _ ^- — ^ 



j=ln £ 



V a C (i) + £(Af(j) + B(j)) V^BtU) 



1=1 



Given the initial state value (with the aggregated information about the living 
phantoms), and the value of A^(j), the expectation for the fixed a process of the 
gradient estimator V Q B;(j) is independent of Xf(J). As n c — > oo, the underlying 
process {Z n } will have the stationary distribution for the j-th estimation batch, so 
that: 



(i+l)n E -l 



1 E 



j=ln e 



i=i 



h [a e (ln e )]+J2 hi[a%ln e )}E lnc 



1=1 

L 



(Z+l)n e -l 



7 E m 



j=ln c 



pE„ ia) [Bi(n)V a Bi(n)] 



h [a e (ln e )} + ^(M e ) + pB[a e (ln £ )])hi[a e (ln e )} + n[a e {ln e )} 



1=1 



which establishes d64l l. Define now a piecewise constant function (on the 6 e 
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subintervals): 

L 

G e (a e (t), A(e)) = / l0 [a e (?n e )]+^(A i (e)+pB[a e (?n e )])/ li [a £ (/n e )]+«[« e (^ e )], 

1=1 

for 8 t < t < (I + l)5 e , then dH implies that: E[a c {t + s) - a £ (t) | m 
jt+s ge^pf^g^ X(e)] ds which implies that the limit process is a martingale with 
zero quadratic variation. For a detailed presentation of this methodology the 
reader is referred to 11191 . Taking now the limit along the weakly convergent sub- 
sequence, a e {t) — > a(t), A(e) — > A establishes the limiting ODE for this subse- 
quence. 

□ 
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